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Abstract 



Examinee response times from a computerized adaptive test were analyzed using a 
hierarchical linear model. Two equations were posed: a within-person model and a between- 
person model. Variance within persons was eight times greater than variance between persons. 
Several variables significantly predicted within-person variance. Response time increased with 
increasing item text length and increasing relative item difficulty. Item sequence was negatively 
related to response time and some content areas required more time than others. Examinees 
spent more time on items they got wrong than on items they got right, and they took longer to 
respond when the correct answer was A, B, or C than when the correct answer was D. Only 
one variable, test anxiety, significantly predicted variance between examinees. Examinee age, 
sex, first language and ethnicity did not predict between-person variance and low able examinees 
did not take longer to respond to items than high able examinees. Understanding how item 
characteristics impact on response time may allow test developers to allot total test time based 
upon the response time history of the individual test items. This study also suggests that 
examinee characteristics are generally not related to response time, but that more controllable 
factors such as item length, position of the keyed correct answer, and use of figures do 
contribute to response time. 
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Computerized Adaptive Testing 
Exploring Examinee Response Time 
Using Hierarchical Linear Modeling 

Computerization allows previously unobtainable data, such as item response time, to be 
collected and used to improve tests. Researchers can now analyze the amount of time that an 
individual takes when answering a test item. Understanding the various factors which impact 
on item response time should be useful to test developers to predict the amount of time required 
for total test administration. Response time may also be used in the future to help identify 
unusual or cheating behaviors (Kingsbury, Zara and Houser, 1993). For example, an individual 
who responds to items more quickly than within the normal response range could be flagged for 
further investigation. 

The arrival of computerized adaptive testing further complicates the importance of overall 
test time and individual item time. The archetype of computer adaptive testing would have the 
test taker answer only as many items as needed to determine the individual’s ability relative to 
a pre-determined pass/fail point or within a specified level of precision. This strategy would 
typically not seek to place any real limits on the amount of time that, an individual could spend 
on any single item, and total test time would vary between individuals based upon both the 
number of items required to determine ability and the examinee’s tendency to take more or less 
time in answering items. However, in real testing situations, maximum time limits are usually 
set due to cost or other administrative issues. 

Rafaeli and Tractinsky (1991) found a strong negative correlation between response time 
and accuracy for general knowledge tests, but not for mathematical reasoning tests administered 
by computer. They suggested that in adaptive survey techniques, time could be allocated to each 
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item based upon the difficulty of the item and test taker ability. The resulting test could be 
significantly shorter, although corresponding examinee satisfaction ratings were likely to decline. 
Rafaeli and Tractinsky also reported that examinees took longer when time limitations were 
based on the total test than when time limits were placed on each item. Examinees, however, 
reported a preference for a total test time limitation. 

As is the case with all new technologies, there is a general sense of fear regarding the 
differential impact of using a computer to administer tests versus traditional paper and pencil 
methods (Fair Test, 1992). Several studies have addressed the issue of overall test-taking time 
on test scores, and additionally whether differences in time interact with demographic variables 
such as race and sex. Generally, these studies found that increased overall testing time improves 
scores, but they found no significant interactions with race or sex (Wild, Durso, and Rubin, 
1982; Evans and Reilly, 1976). 

Previous regression studies of item response time on computerized adaptive tests have 
left much of the variation in time unexplained (Gershon, Bergstrom and Lunz, 1993; Kingsbury 
et al., 1993). Kingsbury, Zara and Houser (1993) reported that none of the variables they 
investigated accounted for more than 8 % of the variance in item response times. 

In our initial study (Gershon et al., 1993), we found several variables to be significant 
predictors of examinee response time per item. Response time increased with increasing item 
text length and increasing item difficulty. Examinees took longer to respond to items at the 
beginning of the test than at the end of the test. Response time also varied by content category, 
whether or not the item contained an illustration, distractor position of the correct response, and 
whether or not the examinee got the item correct. Item level variables accounted for 19% of 
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the variance in response time. Examinee variables (test anxiety, gender, ethnic background, age 
and language) accounted for an additional 2% of the variance in response time. 

This previous research treated individual examinee/item response times as independent 
pieces of data when, in fact, item observations are nested within persons. We believed that at 
least some of the unexplained variance in response time can be attributed to individual person 
differences. 

In this study, we reanalyze our original data using a hierarchical linear model (Bryk, 
Raudenbush, Seltzer and Congdon, 1989) which allows us to separate individual variation in 
response time from error variation, giving better estimates of individual variation and permitting 
exploration of the effects of item level characteristics and person demographics on response 
time. We hypothesize that: 

1) Item characteristics such as text length, sequence of the item on a test, presence of a 
figure, and location of the correct answer (a, b, c, d) will significantly predict variance in 
response time. 

2) Response time variance will be greater across candidates than within candidates. 

3) Response time will vary across candidates based on demographic characteristics (age, 
gender, test anxiety, ethnic background, etc.). 

Data and Instruments 

Data were collected from a certification examination administered in 1991 using a 
computerized adaptive algorithm. Examinees had the option of taking a computerized adaptive 
test or the traditional paper and pencil test. Two hundred four examinees chose to take the 
computerized test. 
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The computerized test was administered with the CAT ADMINISTRATOR software 
(Gershon, 1990) using the PROX method of estimation (Wright & Stone, 1979) for the item 
selection algorithm. A pre-calibrated item bank consisting of 696 items was prepared for the 
examination. Each test item fit on one computer screen, so examinees did not have to scroll 
item text. Content was distributed across six content areas. A content balancing mechanism, 
to insure that item distribution matched the test specifications of the traditional written test, was 
included in the computerized adaptive algorithm. 

Each test began with an item randomly chosen from items within . 10 logits in difficulty 
of the pass/fail point. The following 9 items were constrained to within .10 logits of the 
previously administered item difficulty. This procedure effectively constrained the difficulty of 
the first 10 items to within 1 logit of the pass/ fail point. Thereafter, items were targeted to a 
60% probability of correct response. Items were chosen at random from unused items within 
. 10 logits of the targeted item difficulty within the specified content area. 

Testing stopped when the estimated examinee ability was 1.65 times the standard error 
of measurement above or below the pass point. Minimum test length was 50 items and 
maximum test length was 100 items. Examinees were allotted 2 hours to complete the test. 

Some items contained figures or color plates. These graphics were contained in a 
separate illustration booklet. When this type of item appeared on the screen, examinees were 
instructed to refer to a specific illustration in the booklet. 

Upon completion of the adaptive test, examinees were allowed to review and change 
answers. This paper however, deals only with response time during the initial test 
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administration. Analysis of review times is not included. (See Lunz, Bergstrom and Wright, 
1992 for further information on the effect of allowing review). 

Method 

Response time was recorded from the time the item appeared on the screen until the 
examinee pressed enter and moved to the next item. Figure 1 shows the distributions of 
response time and the log of response time. The mean time per item was 63 seconds with a 
standard deviation of 47 seconds. For this analysis, the log of response time better approximates 
a normally distributed random variable and therefore was used as the dependant variable. 



INSERT FIGURE 1 HERE 



Data were analyzed with a hierarchical linear model (Bryk et al., 1989). Computations 
were performed with a software program for fitting hierarchical linear models (Thum, 1994). 
With the hierarchical linear model we pose two equations, a within-person and a between-person 
model. The within-person model specifies the relationships between t ;J , the log of the observed 
response time on item i for examinee j and various independent variables X^ p . Thus the within- 
person model can be written as: 

tij = /?j 0 + + jSj 2 X ij2 + /3j 3 X ij3 -I- ( 3 ^ + /3 j5 X ij5 -I- /3 j6 Xj j6 + (3 p X ip 

■F /SjgXija + /?jjXjj9 + /jjicXijio + / 3 jiiX iju — / 3 ji 2 X,j 12 -I- ^13X^3 + R,j (1) 
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where 
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X ij4 
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Xijio 

X m 

X m 

X ijl3 



R„ 



is the log of the response time on item i for examinee j; 
is the relative difficulty of item i for examinee /; 

is the correctness of response of item i for examinee j (0 = wrong, 1 = right); 
is the position of administration (sequence) of item i for examinee j', 
is the item length; 2 

is the content category of item i for examinee j (0 = Content 6, 1= content 1); 

is the content category of item i for examinee j (0= Content 6, 1= content 2); 

is the content category of item i for examinee j (0= Content 6, 1= content 3); 

is the content category of item i for examinee j (0 = Content 6, 1= content 4); 

is the content category of item i for examinee j (0= Content 6, 1= content 5); 

is the position of the correct answer of item i for examinee j (0 = D, 1 = A); 

is the position of the correct answer of item i for examinee j (0 = D, 1 = B); 

is the position of the correct answer of item i for examinee j (0 = D, 1 = C); 

is the figure status of item i for examinee j (0 = no figure, 1 = figure); 
represents random error; 



and, 



I3 JP are regression coefficients that characterize the structural relationship within 
person 



1 Relative item difficulty is calculated as final estimate of examinee ability minus item difficulty. Two additional variables were 
considered for the analysis^xaminee ability and item difficulty. Due to the targeting of computerized adaptive tests, all three variables are 
highly correlated. 

2 Item length was calculated as the number of characters in the item including stem and all distractors. 



